Clicker Questions
to go along with
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton
Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
- The reason to take random samples is:1
- to make cause and effect conclusions
- to get as many variables as possible
- it’s easier to collect a large dataset
- so that the data are a good representation of the population
- I have no idea why one would take a random sample
- The reason to allocate/assign explanatory variables is:2
- to make cause and effect conclusions
- to get as many variables as possible
- it’s easier to collect a large dataset
- so that the data are a good representation of the population
- I have no idea what you mean by “allocate/assign” (or “explanatory variable” for that matter)
- Approximately how big is a tweet?3
- 0.01Kb
- 0.1Kb
- 1Kb
- 100Kb
- 1000Kb = 1Mb
- \(R^2\) measures:4
- the proportion of variability in vote margin as explained by tweet share.
- the proportion of variability in tweet share as explained by vote margin.
- how appropriate the linear part of the linear model is.
- whether or not particular variables should be included in the model.
- R / R Studio / Quarto5
- all good
- started, progress is slow and steady
- started, very stuck
- haven’t started yet
- what do you mean by “R”?
- Git / GitHub6
- all good
- started, progress is slow and steady
- started, very stuck
- haven’t started yet
- what do you mean by “Git”?
- Which of the following includes talking to the remote version of GitHub?7
- changing your name (updating the YAML)
- committing the file(s)
- pushing the file(s)
- some of the above
- all of the above
- What is the error?8
- poor assignment operator
- unmatched quotes
- improper syntax for function argument
- invalid object name
- no mistake
- What is the error?9
- poor assignment operator
- unmatched quotes
- improper syntax for function argument
- invalid object name
- no mistake
- What is the error?10
- poor assignment operator
- unmatched quotes
- improper syntax for function argument
- invalid object name
- no mistake
- What is the error?11
- poor assignment operator
- unmatched quotes
- improper syntax for function argument
- invalid object name
- no mistake
- What is the error?12
- poor assignment operator
- unmatched quotes
- improper syntax for function argument
- invalid object name
- no mistake
- Do you keep a calendar / schedule / planner?13
- Yes
- No
- Do you keep a calendar / schedule / planner? If you answered “Yes” …14
- Yes, on Google Calendar
- Yes, on Calendar for macOS
- Yes, on Outlook for Windows
- Yes, in some other app
- Yes, by hand
- Where should I put things I’ve created for the HW (e.g., data, .ics file, etc.)15
- Upload into remote GitHub directory
- In the local folder which also has the R project
- In my Downloads
- Somewhere on my Desktop
- In my Home directory
- The goal of making a figure is…16
- To draw attention to your work.
- To facilitate comparisons.
- To provide as much information as possible.
- A good reason to make a particular choice of a graph is:17
- Because the journal / field has particular expectations for how the data are presented.
- Because some variables naturally fit better on some graphs (e.g., numbers on scatter plots).
- Because that graphic displays the message you want as optimally as possible.
- Why are the points orange?18
- R translates “navy” into orange.
- color must be specified in
geom_point() - color must be specified outside the
aes()function - the default plot color is orange
- Why are the dots blue and the lines colored?19
- dot color is given as “navy”, line color is given as
wday. - both colors are specified in the
ggplot()function. - dot coloring takes precedence over line coloring.
- line coloring takes precedence over dot coloring.
- dot color is given as “navy”, line color is given as
- Setting vs. Mapping. If I want information to be passed to all data points (not variable):20
- map the information inside the
aes()function. - set the information outside the
aes()function
- map the information inside the
- The Snow figure was most successful at:21
- making the data stand out
- facilitating comparison
- putting the work in context
- simplifying the story
- The Challenger figure(s) was(were) least successful at:22
- making the data stand out
- facilitating comparison
- putting the work in context
- simplifying the story
- The biggest difference between Snow and the Challenger was:23
- The amount of information portrayed.
- One was better at displaying cause.
- One showed the relevant comparison better.
- One was more artistic.
- Caffeine and Calories. What was the biggest concern over the average value axes?24
- It isn’t at the origin.
- They should have used all the data possible to find averages.
- There wasn’t a random sample.
- There wasn’t a label explaining why the axes were where they were.
- What is wrong with the following code?25
- should only be one =
- Sydney should be lower case
- name should not be in quotes
- use mutate instead of filter
- babynames in wrong place
- Which data represents the ideal format for ggplot2 and dplyr?26
| year | Algeria | Brazil | Columbia |
|---|---|---|---|
| 2000 | 7 | 12 | 16 |
| 2001 | 9 | 14 | 18 |
| country | Y2000 | Y2001 |
|---|---|---|
| Algeria | 7 | 9 |
| Brazil | 12 | 14 |
| Columbia | 16 | 18 |
| country | year | value |
|---|---|---|
| Algeria | 2000 | 7 |
| Algeria | 2001 | 9 |
| Brazil | 2000 | 12 |
| Brazil | 2001 | 14 |
| Columbia | 2000 | 16 |
| Columbia | 2001 | 18 |
- Each of the statements except one will accomplish the same calculation. Which one does not match?27
#(a)
babynames |>
group_by(year, sex) |>
summarize(totalBirths = sum(num))
#(b)
group_by(babynames, year, sex) |>
summarize(totalBirths = sum(num))
#(c)
group_by(babynames, year, sex) |>
summarize(totalBirths = mean(num))
#(d)
temp <- group_by(babynames, year, sex)
summarize(temp, totalBirths = sum(num))
#(e)
summarize(group_by(babynames, year, sex),
totalBirths = sum(num))- Fill in Q1.28
filter()arrange()select()mutate()group_by()
- Fill in Q2.29
(year, sex)(year, name)(year, num)(sex, name)(sex, num)
- Fill in Q3.30
n_distinct(name)n_distinct(n)sum(name)sum(num)mean(num)
- Running the code.31
babynames <- babynames::babynames |>
rename(num = n)
babynames |>
filter(name %in% c("Jane", "Mary")) |>
# just the Janes and Marys
group_by(name, year) |>
# for each year for each name
summarize(total = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year total
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(number = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year number
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(name))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(name)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(num))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(num)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
Error in `summarize()`:
ℹ In argument: `sum(name)`.
ℹ In group 1: `name = "Jane"` and `year = 1880`.
Caused by error in `base::sum()`:
! invalid 'type' (character) of argument
# A tibble: 276 × 3
# Groups: name [2]
name year `mean(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
# A tibble: 276 × 3
# Groups: name [2]
name year `median(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
- Fill in Q1.32
gdpyeargdpvalcountry–country
- Fill in Q2.33
gdpyeargdpvalcountry–country
- Fill in Q3.34
gdpyeargdpvalcountry–country
- Response to stimulus (in ms) after only 3 hrs of sleep for 9 days. You want to make a plot with the subject’s reaction time (y-axis) vs the number of days of sleep restriction (x-axis) using the following
ggplot()code. Which data frame should you use?35- use raw data
- use
pivot_wider()on raw data - use
pivot_longer()on raw data
# A tibble: 18 × 11
Subject day_0 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 308 250. 259. 251. 321. 357. 415. 382. 290. 431. 466.
2 309 223. 205. 203. 205. 208. 216. 214. 218. 224. 237.
3 310 199. 194. 234. 233. 229. 220. 235. 256. 261. 248.
4 330 322. 300. 284. 285. 286. 298. 280. 318. 305. 354.
5 331 288. 285 302. 320. 316. 293. 290. 335. 294. 372.
6 332 235. 243. 273. 310. 317. 310 454. 347. 330. 254.
7 333 284. 290. 277. 300. 297. 338. 332. 349. 333. 362.
8 334 265. 276. 243. 255. 279. 284. 306. 332. 336. 377.
9 335 242. 274. 254. 271. 251. 255. 245. 235. 236. 237.
10 337 312. 314. 292. 346. 366. 392. 404. 417. 456. 459.
11 349 236. 230. 239. 255. 251. 270. 282. 308. 336. 352.
12 350 256. 243. 256. 256. 269. 330. 379. 363. 394. 389.
13 351 251. 300. 270. 281. 272. 305. 288. 267. 322. 348.
14 352 222. 298. 327. 347. 349. 353. 354. 360. 376. 389.
15 369 272. 268. 257. 278. 315. 317. 298. 348. 340. 367.
16 370 225. 235. 239. 240. 268. 344. 281. 348. 365. 372.
17 371 270. 272. 278. 282. 279. 285. 259. 305. 351. 369.
18 372 269. 273. 298. 311. 287. 330. 334. 343. 369. 364.
sleep_long <- sleep_wide %>%
pivot_longer(cols = -Subject,
names_to = "day",
names_prefix = "day_",
values_to = "reaction_time")
sleep_long# A tibble: 180 × 3
Subject day reaction_time
<dbl> <chr> <dbl>
1 308 0 250.
2 308 1 259.
3 308 2 251.
4 308 3 321.
5 308 4 357.
6 308 5 415.
7 308 6 382.
8 308 7 290.
9 308 8 431.
10 308 9 466.
# ℹ 170 more rows
- Consider band members from the Beatles and the Rolling Stones. Who is removed in a
right_join()?36
- Mick
- John
- Paul
- Keith
- Impossible to know
- Consider band members from the Beatles and the Rolling Stones. Which variables are removed in a
right_join()?37
namebandplays- none of them
- What happens to Mick’s
playsvariable in afull_join()?38
- Mick is removed
- changes to guitar
- changes to bass
NANULL
- Consider the
addTen()function. The following output is a result of whichmap_*()call?39
map(c(1,4,7), addTen)map_dbl(c(1,4,7), addTen)map_chr(c(1,4,7), addTen)map_lgl(c(1,4,7), addTen)
[1] "11.000000" "14.000000" "17.000000"
- Which of the following input is allowed?40
map(c(1, 4, 7), addTen)map(list(1, 4, 7), addTen)map(data.frame(a=1, b=4, c=7), addTen)- some of the above
- all of the above
- Which of the following produces a different output?41
map(c(1, 4, 7), addTen)map(c(1, 4, 7), ~addTen(.x))map(c(1, 4, 7), ~addTen)map(c(1, 4, 7), function(hi) (hi + 10))map(c(1, 4, 7), ~(.x + 10))
- What will the following code output?42
- 3 random normals
- 6 random normals
- 18 random normals
:::
Footnotes
- so that the data are a good representation of the population
- to make cause and effect conclusions
- about 0.1Kb. Turns out that 3.5 billion tweets * 0.1Kb = 350Gb (0.35 Tb). My laptop is pretty good, and it has 36 Gb of memory (RAM) and 4 Tb of storage. It would not be able to work with 3.5 billion tweets.
- the proportion of variability in vote margin as explained by tweet share.
wherever you are, make sure you are communicating with me when you have questions!↩︎
wherever you are, make sure you are communicating with me when you have questions!↩︎
- pushing the file(s)
- poor assignment operator
- invalid object name
- unmatched quotes
- no mistake
- improper syntax for a function argument
- I mean, the right answer has to be Yes, right!??!
no right answer here!↩︎
- In the local folder which also has the R project. It could be on the Desktop or the Home directory, but it must be in the same place as the R project. Do not upload files to the remote GitHub directory or you will find yourself with two different copies of the files.
Yes! All the responses are reasons to make a figure.↩︎
- Because that graphic displays the message you want as optimally as possible.
- color must be specified outside the
aes()function
- color must be specified outside the
- dot color is specified as “navy”, line color is specified as
wday.
- dot color is specified as “navy”, line color is specified as
- set the information outside the
aes()function
- set the information outside the
answers may vary. I’d say c. putting the work in context. Others might say b. facilitating comparison or d. simplifying the story. However, I don’t think a correct answer is a. making the data stand out.↩︎
- making the data stand out
- One showed the relevant comparison better.
- It isn’t at the origin. in combination with d. There wasn’t a label explaining why the axes were where they were. The story associated with the average value axes is not clear to the reader.
- babynames in wrong place
- Table c is best because the columns allow us to work with each of the variable separately.
- does something different because it takes the
mean()(average) instead of thesum(). The other commands compute the total number of births broken down byyearandsex.
- does something different because it takes the
filter()
(year, name)
sum(num)
running the different code chunks with relevant output.↩︎
-country
year
gdpval(if possible, good idea to name variables something different from the name of the data frame)
- use
pivot_longer()on raw data. The reference to the study is: Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research 12, 1–12.
- use
- Mick
- none of them (the default is to retain all the variables)
NA(it would beNULLin SQL)
map_chr(c(1,4,7), addTen)because the output is in quotes, the values are strings, not numbers.
- all of the above. The
map()function allows vectors, lists, and data frames as input.
- all of the above. The
map(c(1, 4, 7), ~addTen). The~acts on functions that do not have their own name or that are defined byfunction(...). By adding the argument(.x)we’ve expanded theaddTen()function, and so it needs a~. TheaddTen()function all alone does not use a~.
- 6 random normals (1 with mean 1, sd 3; 2 with mean 3, sd 1; 3 with mean 47, sd 10)